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Abstract 

We give a general unified method that can be used for L\ closeness testing of a wide range 
of univariate structured distribution families. More specifically, we design a sample optimal 
and computationally efficient algorithm for testing the equivalence of two unknown (potentially 
arbitrary) univariate distributions under the kh-distance metric: Given sample access to distri¬ 
butions with density functions p, q : I —> R, we want to distinguish between the cases that p = q 
and \\p— q\\A k > e with probability at least 2/3. We show that for any k > 2, e > 0, the optimal 
sample complexity of the kh-closeness testing problem is 0(max{fc 4 / 5 /e 6 / 5 , fc 1/,2 /e 2 }). This is 
the first o(fc) sample algorithm for this problem, and yields new, simple L\ closeness testers, in 
most cases with optimal sample complexity, for broad classes of structured distributions. 
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1 Introduction 


We study the problem of closeness testing (equivalence testing) between two unknown probability 
distributions. Given independent samples from a pair of distributions p. q, we want to determine 
whether the two distributions are the same versus significantly different. This is a classical problem 
in statistical hypothesis testing [NP3311LR.05] that has received considerable attention by the TCS 
community in the framework of property testing |RS96| IGGR98j : given sample access to distribu¬ 
tions p, q, and a parameter e > 0, we want to distinguish between the cases that p and q are identical 
versus e-far from each other in L\ norm (statistical distance). Previous work on this problem fo¬ 
cused on characterizing the sample size needed to test the identity of two arbitrary distributions of 
a given support size jBFR + 00l IGDVV 14] . It is now known that the optimal sample complexity (and 
running time) of this problem for distributions with support of size n is 0(max{n 2 / 3 /e 4 / 3 , n 1/2 A 2 }>. 

The aforementioned sample complexity characterizes worst-case instances, and one might hope 
that drastically better results can be obtained for most natural settings, in particular when the 
underlying distributions are known a priori to have some “nice structure”. In this work, we focus 
on the problem of testing closeness for structured distributions. Let C be a family over univariate 
distributions. The problem of closeness testing for C is the following: Given sample access to 
two unknown distribution p,q € C, we want to distinguish between the case that p = q versus 
\\p ~ <7||i > e. Note that the sample complexity of this testing problem depends on the underlying 
class C, and we are interested in obtaining efficient algorithms that are sample optimal for C. 

We give a general algorithm that can be used for L\ closeness testing of a wide range of 
structured distribution families. More specifically, we give a sample optimal and computationally 
efficient algorithm for testing the identity of two unknown (potentially arbitrary) distributions p, q 
under a different metric between distributions - the so called ^-distance (see Section [2] for a formal 
definition). Here, A; is a positive integer that intuitively captures the number of “crossings” between 
the probability density functions p, q. 

Our main result (see Theorem [1|) says the following: For any k € Z_|_,e > 0, and sample 
access to arbitrary univariate distributions p, q, there exists a closeness testing algorithm under 
the Ak-distance using 0(max{A: 4 / 5 /e 6 / 5 , /c 1 / 2 /e 2 }) samples. Moreover, this bound is information- 
theoretically optimal. We remark that our ^-testing algorithm applies to any pair of univariate 
distributions (over both continuous and discrete domains). The main idea in using this general 
algorithm for testing closeness of structured distributions in L\ distance is this: if the underlying 
distributions p,q belong to a structured distribution family C, we can use the ^-distance as a 
proxy for the L\ distance (for an appropriate value of the parameter k), and thus obtain an L\ 
closeness tester for C. 

We note that ^.-distance between distributions has been recently used to obtain sample op¬ 
timal efficient algorithms for learning structured distributions [GDSS141IADLS15] , and for testing 
the identity of a structured distribution against an explicitly known distribution jDKN15j (e.g., 
uniformity testing). In both these settings, the sample complexity of the corresponding problem 
(learning/identity testing) with respect to the ^-distance is identified with the sample complexity 
of the problem under the L\ distance for distributions of support k. More specifically, the sample 
complexity of learning an unknown univariate distribution (over a continuous or discrete domain) 
up to ^-distance e is Q(k/e 2 ) [CDSS14] (independent of the domain size), which is exactly the 
sample complexity of learning a discrete distribution with support size k up to L\ error e. Simi¬ 
larly, the sample complexity of uniformity testing of a univariate distribution (over a continuous 
or discrete domain) up to ^-distance e is 0(fc 1//2 /e 2 ) [DKN15] (again, independent of the domain 
size), which is identical to the sample complexity of uniformity testing of a discrete distribution 
with support size k up to L\ error e |Pan08] . 
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Rather surprisingly, this analogy is provably false for the closeness testing problem: we prove 
that the sample complexity of the Ak closeness testing problem is 0(max{fc 4 / 5 /e 6//5 , fc 1//2 /e 2 }), while 
L\ closeness testing between distributions of support k can be achieved with 0(max{& 2//3 /e 4//3 , Ad/ 2 /e 2 }) 
samples [CDVV 14] . More specifically, our upper bound for Ak closeness testing problem applies for 
all univariate probability distributions (both continuous and discrete). Our matching information- 
theoretic lower bound holds for continuous distributions, or discrete distributions of support size n 
sufficiently large as a function of k, which is the most interesting regime for our applications. 


1.1 Related and Prior Work In this subsection we review the related literature and compare 
our results with previous work. 

Distribution Property Testing Testing properties of distributions BFR + 00l |BFR + 13] has de¬ 
veloped into a mature research area within theoretical computer science. The paradigmatic problem 
in this field is the following: given sample access to one or more unknown probability distributions, 
determine whether they satisfy some global property or are “far” from satisfying the property. The 
goal is to obtain an algorithm for this task that is both statistically and computationally efficient, 
i.e., an algorithm with (information-theoretically) optimal sample size and polynomial runtime. 
See |GR001 IbFR+ 001IBFF+Oll IBatOll IBDKRMIIBKR.041 IPan081 IValTTl IVV111lDDS+131 lAD.T+11 


ILRR111IILR121ICDVV141 IVV141 IDKN15| for a sample of works, and [Rubl2| for a survey. 

Shape Restricted Estimation Statistical estimation under shape restrictions - i.e., inference 
about a probability distribution under the constraint that its probability density function satisfies 
certain qualitative properties - is a classical topic in statistics [BBBB72] , Various structural re¬ 
strictions have been studied in the literature, starting from nronotonicity, unimodality, convexity, 
and concavity |Gre561 IBru581 IRao691 |Weg70[ IHP761 IGro851 IBir87al IBir87bl IFou971 IGT041 1.7 W09] , 
and more recently focusing on structural restrictions such as log-concavity and fc-monotonicity 
[BWD71 IPR99I IBRW09I IGWTM iBWTnl IKM1 91 IWalM IDW1 31 IGS131 IKS141 IBDTH iHWTTi] . The 
reader is referred to |GJ14| for a recent book on the topic. 

Comparison with Prior Work Chan, Diakonikolas, Servedio, and Sun [CDSS14] proposed a 
general approach to L\ learn univariate probability distributions whose densities are well approxi¬ 
mated by piecewise polynomials. They designed an efficient agnostic learning algorithm for piece- 
wise polynomial distributions, and as a corollary obtained efficient learners for various families of 
structured distributions. The approach of [GDSS14] uses the Ak distance metric between distribu¬ 
tions, but is otherwise orthogonal to ours. Batu et al. |BKR04| gave algorithms for closeness testing 
between two monotone distributions with sample complexity 0(log 3 n). Subsequently, Daskalakis 
et al. DDS + 13~] improved and generalized this result to f-modal distributions, obtaining a close¬ 
ness tester with sample complexity 0((f log(n)) 2 / 3 /e 8//3 4- f 2 /e 4 ). We remark that the approach 
of DDS + 13] inherently yields an algorithm with sample complexity G(f), which is sub-optimal. 

The main ideas underlying this work are very different from those of [l)DS + 13| and [DKN15j . 
The approach of lDDS + 13 involves constructing an adaptive interval decomposition of the domain 
followed by an application of a (known) closeness tester to the “reduced” distributions over those 
intervals. This approach incurs an extraneous term in the sample complexity, that is needed to 
construct the appropriate decomposition. The approach of |DKN15j considers several oblivious 
interval decompositions of the domain (i.e., without drawing any samples) and applies a “reduced” 
identity tester for each such decomposition. This idea yields sample-optimal bounds for Ak identity 
testing against a known distribution. However, it crucially exploits the knowledge of the explicit 
distribution, and unfortunately fails in the setting where both distributions are unknown. We 
elaborate on these points in Section 12.31 
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2 Our Results and Techniques 

2.1 Basic Definitions We will use p, q to denote the probability density functions (or probability 
mass functions) of our distributions. If p is discrete over support [n] := {1,... ,n}, we denote by 
Pi the probability of element i in the distribution. For two discrete distributions p, q, their L\ and 
L 2 distances are || p - q\\i = Yh=\ I Pi ~ Qi\ and II P ~ Q h = \/J2i=i(Pi ~ Vi) 2 - For I C M and density 
functions p,q : I R+, we have || p — q||i = Jj \p(x) — q(x)\dx. 

Fix a partition of the domain I into disjoint intervals X := (/j)^ =1 . For such a partition X, the 
reduced distribution p^ corresponding to p and X is the discrete distribution over \l\ that assigns 
the z-th “point” the mass that p assigns to the interval Ip i.e., for i € [£], p^{i) = p{h)- Let Zk be 
the collection of all partitions of the domain / into k intervals. For p, q : / —> R + and k € Z+, we 
define the M^-distance between p and q by 

\\p-q\\A k A = max E Wi) ~ =max||pf-^||i. 

l=(h) k i=1 &Zk i=i 

2.2 Our Results Our main result is an optimal algorithm and a matching information-theoretic 
lower bound for the problem of testing the equivalence between two unknown univariate distribu¬ 
tions under the Ak distance metric: 

Theorem 1 (Main). Given e > 0 , an integer k > 2, and sample access to two distributions 
with probability density functions p,q : [0,1] —>• M +J , there is a computationally efficient algorithm 
which uses 0(max{fc 4 / 5 /e 6//5 , k 1 / 2 /e 2 }) samples from p,q, and with probability at least 2/3 distin¬ 
guishes whether q = p versus \\q — p\\A k — e ■ Additionally, fi(max{£; 4 / 5 /e 6//5 , fc 1 / 2 /e 2 }) samples are 
information-theoretically necessary for this task. 

Note that Theorem |T] applies to arbitrary univariate distributions (over both continuous and 
discrete domains). In particular, the sample complexity of the algorithm does not depend on the 
support size of the underlying distributions. We believe that the notion of testing under the Ak 
distance is very natural, and well suited for (arbitrary) continuous distributions, where the notion 
of L\ testing is (provably) impossible. 

As a corollary of Theorem [H we obtain sample-optimal algorithms for the L\ closeness testing 
of various structured distribution families C in a unified way. The basic idea is to use the Ak 
distance as a “proxy” for the L\ distance for an appropriate value of k that depends on C and e. 
We have the following simple fact: 

Fact 2. For a univariate distribution family C and e > 0, let k = k(C,e ) be the smallest integer 
such that for any /i, / 2 € C it holds that ||/i — / 2 ||i < ||/i — /b11+ e/2. Then there exists an L\ 
closeness testing algorithm for C using 0(max{fe 4 / 5 /e 6 / 5 , fc 1//2 /e 2 }) samples. 

Indeed, given sample access to q,p E C, we apply the Mfc-closeness testing algorithm of Theo¬ 
rem |T| for the value of k in the statement of the fact, and error e' = e/2. If q = p, the algorithm 
will output “YES” with probability at least 2/3. If ||<? — p||i > e, then by the condition of Fact [2] 
we have that ||q — p\\A k — e> > an d the algorithm will output “NO” with probability at least 2/3. 

We remark that the value of k in Fact [2] is a natural complexity measure for the difference 
between two probability density functions in the class C. It follows from the definition of the Ak 
distance that this value corresponds to the number of “essential” crossings between fi and / 2 - i.e., 
the number of crossings between the functions fi and / 2 that significantly affect their L\ distance. 
Intuitively, the number of essential crossings - as opposed to the domain size - is, in some sense, 
the “right” parameter to characterize the sample complexity of L\ closeness testing for C. 
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Distribution Family 

Our upper bound 

Previous work 

f-piecewise constant 

°( max{^,^}) 

0(£) |CDSS14] 

f-piecewise degree-d 


0 (t(d4T) ^ CDSS14 

log-concave 

o(M 

O(^) [CDSSTij 

/e-mixture of log-concave 

0(max{^,^}) 

O(J^) [CDSSU 

f-modal over [n] 


0( (fl ° s 8/f /3 +?) [DDS+13 

MHR over [ n] 

o(max{ i °«w.‘ ) * /, .'° s( ; / ;/ / ’}) 

0 (iog(nA)) CDSS1 4j 


Table 1: Algorithmic results for closeness testing of selected families of structured probability 
distributions. The second column indicates the sample complexity of our general algorithm applied 
to the class under consideration. The third column indicates the sample complexity of the best 
previously known algorithm for the same problem. 


The upper bound implied by the above fact is information-theoretically optimal for a wide 
range of structured distribution classes C. In particular, our bounds apply to all the structured 
distribution families considered in |CDSS14)lDKN15llADLS15] including (arbitrary mixtures of) t- 
flat (i.e., piecewise constant with t pieces), f-piecewise degree-d polynomials, f-monotone, monotone 
hazard rate, and log-concave distributions. For f-flat distributions we obtain an L\ closeness testing 
algorithm that uses 0(max{f 4//5 /e 6//5 , f 1//2 /e 2 }) samples, which is the first o(f) sample algorithm for 
the problem. For log-concave distributions, we obtain a sample size of 0(e~ 9 / 4 ) matching the 
information-theoretic lower bound even for the case that one of the distributions is explicitly 
given ;I)K.\ la . Table 1 summarizes our upper bounds for a selection of natural and well-studied 
distribution families. These results are obtained from Theorem [T| and Fact [2J via the appropriate 
structural approximation results [CDSSIT ‘CDSS14] . 

We would like to stress that our algorithm and its analysis are very different than previous 
results in the property testing literature. We elaborate on this point in the following subsection. 

2.3 Our Techniques In this subsection, we provide a high-level overview of our techniques in 
tandem with a comparison to prior work. 

Our upper bound is achieved by an explicit, sample near-linear-time algorithm. A good starting 
point for considering this problem would be the testing algorithm of |DKN15j . which deals with 
the case where p is an explicitly known distribution. The basic idea of the testing algorithm in this 
case |DKN15| is to partition the domain into intervals in several different ways, and run a known 
Z /2 tester on the reduced distributions (with respect to the intervals in the partition) as a black-box. 
At a high-level, these intervals partitions can be constructed by exploiting our knowledge of p, in 
order to divide our domain into several equal mass intervals under p. It can be shown that if p 
and q have large Ak distance from each other, one of these partitions will be able to detect the 
difference. 

Generalizing this algorithm to the case where p is unknown turns out to be challenging, because 
there seems to be no way to find the appropriate interval partitions with o{k ) samples. If we allowed 
ourselves to take Q(k/e) samples from p, we would be able to approximate an appropriate interval 
partition, and make the aforementioned approach go through. Alas, this would not lead to an 
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o{k) sample algorithm. If we can only draw m samples from our distributions, the best that we 
could hope to do would be to use our samples in order to partition the domain into m + 1 interval 
regions. This, of course, is not going to be sufficient to allow an analysis along the lines of the 
above approach to work. In particular, if we partition our domain deterministically into m = o(k ) 
intervals, it may well be the case that the reduced distributions over those intervals are identical, 
despite the fact that the original distributions have large Ak distance. In essence, the differences 
between p and q may well cancel each other out on the chosen intervals. 

However, it is important to note that our interval boundaries are not deterministic. This 
suggests that unless we get unlucky, the discrepancy between p and q will not actually cancel out in 
our partition. As a slight modification of this idea, instead of partitioning the domain into intervals 
(which we expect to have only 0(1) samples each) and comparing the number of samples from p 
versus q in each, we sort our samples and test how many of them came from the same distribution 
as their neighbors (with respect to the natural ordering on the real line). 

We intuitively expect that, if p = q, the number of pairs of ordered samples drawn from the 
same distribution versus a different one will be the same. Indeed, this can be formalized and the 
completeness of this tester is simple to establish. The soundness analysis, however, is somewhat 
involved. We need to show that the expected value of the statistic that we compute is larger than 
its standard deviation. While the variance is easy to bound from above, bounding the expectation 
is quite challenging. To do so, we define a function, f(t), that encodes how likely it is that the 
samples nearby point t come from one distribution or the other. It turns out that / satisfies a 
relatively nice differential equation, and relates in a clean way to the expectation of our statistic. 
From this, we can show that any discrepancy between p and q taking place on a scale too short to 
be detected by the above partitioning approach will yield a notable contribution to our expectation. 

The analysis of our lower bound begins by considering a natural class of testers, namely those 
that take some number of samples from p and q, sort the samples (while keeping track of which 
distribution they came from) and return an output that depends only on the ordering of these 
samples. For such testers we exhibit explicit families of pairs of distributions that are hard to 
distinguish from being identical. There is a particular pattern that appears many times in these 
examples, where there is a small interval for which q has an appropriate amount of probability mass, 
followed by an interval of p, followed by another interval of q. When the parameters are balanced 
correctly, it can be shown that when at most two samples are drawn from this subinterval, the 
distribution on their orderings is indistinguishable from the case where p = q. By constructing 
distributions with many copies of the pattern, we essentially show that a tester of this form will 
not be able to be confident that p A q, unless there are many of these small intervals from which 
it draws three or more samples. On the other hand, a simple argument shows that this is unlikely 
to be the case. 

The above lower bound provides explicit distributions that are hard to distinguish from being 
identical by any tester in this limited class. To prove a lower bound against general testers, we 
proceed via a reduction: we show that an order-based tester can be derived from any general 
tester. It should be noted that this makes our lower bound in a sense non-constructive, as we do 
not know of any explicit families of distributions that are hard to distinguish from uniform for 
general testers. In order to perform this reduction, we show that for a general tester we can find 
some large subset S of its domain such that if all samples drawn from p and q by the tester happen 
to lie in S, then the output of the tester will depend only on the ordering of the samples. This 
essentially amounts to a standard result from Ramsey theory. Then, by taking any other problem, 
we can embed it into our new sample space by choosing new p and q that are the same up to an 
order-preserving rearrangement of the domain (which will also preserve Ak distance), ensuring that 
they are supported only on S. 
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3 Algorithm for Ak Closeness Testing 

In this section we provide the sample optimal closeness tester under the Ak distance. 

3.1 An 0(/c 4 / 5 /e 6 / 5 )-sample tester In this subsection we give a tester with sample complexity 
0(fc 4 / 5 /e 6 / 5 ) that applies for e = For simplicity, we focus on the case that we take 

samples from two unknown distributions with probability density functions p, q : [0,1] —>• M+. Our 

results are easily seen to extend to discrete probability distributions. 

Algorithm Simple-Test-Identity- q, e) 

Input: sample access to pdf’s p, q : [0,1] —> M+, k € Z + , and e > 0. 

Output: “YES” if q = p; “NO” if \\q - p\\ Ak > e. 

1. Let m = C ■ (fc 4 / 5 /e 6 / 5 ), for a sufficiently large constant C. Draw two sets of samples 
S p , S q each of size Poi(m) from p and from q respectively. 

2. Merge S p and S q while remembering from which distribution each sample comes from. 

Let S be the union of S p and S q sorted in increasing order (breaking ties randomly). 

3. Compute the statistic Z defined as follows: 

Z = jf (pairs of successive samples in S coming from the same distribution) — 

ff (pairs of successive samples in S coming from different distributions) 

4. If Z > 3 • (y/rri) return ”NO”. Otherwise return ”YES”. 


Proposition 3. The algorithm Simple-Test-Identity-Afc(p, q, e), on input two samples each of size 
0(fc 4 / 5 /e 6 / 5 ) drawn from two distributions with densities p,q : [0,1] —>• M +; an integer k > 2, and 
e = D(fc -1 / 6 ), correctly distinguishes the case that q = p from the case \\p—q\\ Ak — e > w ith probability 
at least 2/3. 

Proof. First, it is straightforward to verify the claimed sample complexity, since the algorithm only 
draws samples in Step 1. To simplify the analysis we make essential use of the following simple 
claim: 

Claim 4. We can assume without loss of generality that the pdf’s p, q : [0,1] —>• M + are continuous 
functions bounded from above by 2. 

Proof. We start by showing we can assume that p, q are at most 2. Let p, q : [0,1] —>• M+ be arbitrary 
pdf’s. We consider the cumulative distribution function (CDF) 4> of the mixture (p + q)/2. Let 
X ~ p, Y ~ q, W ~ (p + q)/2 be random variables. Since 4> is non-decreasing, replacing X and Y 
by $(A) and 4>(Y) does not affect the algorithm (as the ordering on the samples remains the same). 
We claim that, after making this replacement, <f»(A) and <h(Y) are continuous distributions with 
probability density functions bounded by 2. In fact, we will show that the sum of their probability 
density functions is exactly 2. This is because for any 0 ^ a ^ b ^ 1, 

Pr[$(A) € [a, 6]] + Pr[$(Y) € [a, b ]] = 2 Pr[$(W) € [a, b ]] = 2(6 - a) , 

where the second equality is by the definition of a CDF. Thus, we can assume that p and q are 
bounded from above by 2. 
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To show that we can assume continuity, note that p and q can be approximated by continuous 
density functions p' and q' so that the L\ errors \\p — p'\\i, ||<7 — q' ||i are each at most l/(10m). If 
our algorithm succeeds with the continuous densities p' and q', it must also succeed for p and q. 
Indeed, since the L\ distance between p and p' and q and q’ is at most l/(10m), a set of m samples 
taken from p or q are statistically indistinguishable to m samples taken from p' or q'. This proves 
that it is no loss of generality to assume that p and q are continuous. □ 

Note that the algorithm makes use of the well-known “Poissonization” approach. Namely, 
instead of drawing m = 0(k 4 ^ 5 /e 6// ' 5 ) samples from p and from q , we draw m' = Poi(m) samples 
from p and m" = Poi(m) sample from q. The crucial properties of the Poisson distribution are that 
it is sharply concentrated around its mean and it makes the number of times different elements 
occur in the sample independent. 

We now establish completeness. Note that our algorithm draws Poi(2m) samples from p or 
q. If p = q, then our process equivalently selects Poi(2m) values from p and then randomly and 
independently with equal probability decides whether or not each sample came from p or from q. 
Making these decisions one at a time in increasing order of points, we note that each adjacent pair 
of elements in S randomly and independently contributes either a +1 or a —1 to Z. Therefore, 
the distribution of Z is exactly that of a sum of Poi(2m) — 1 independent {±1} random variables. 
Therefore, Z has mean 0 and variance 2m— 1. By Chebyshev’s inequality it follows that \Z\ ^ 3 y/rn 
with probability at least 7/9. This proves completeness. 

We now proceed to prove the soundness of our algorithm. Assuming that \\p — q\\A k > G we want 
to show that the value of Z is at most 3-y/m with probability at most 1/3. To prove this statement, 
we will again use Chebyshev’s inequality. In this case it suffices to show that E[Z] S> y / Var[Z]-|-y / m 
for the inequality to be applicable. We begin with an important definition. 

Definition 5. Let / : [0,1] [—1,1] equal 

f(t) = Pr [largest sample in S that is at most t was drawn from p] 

— Pr [largest sample in S that is at most t was drawn from q] . 

The importance of this function is demonstrated by the following lemma. 

Lemma 6. We have that: E [Z\ = m f(t)(p(t) — q(t))dt . 

Proof. Given an interval /, we let Zj be the contribution to Z coming from pairs of consecutive 

points of S the larger of which is drawn from I. We wish to approximate the expectation of Zj. 

We let t(I) = m(p(I) + q(I)) be the expected total number of points drawn from I. We note that 
the contribution coming from cases where more than one point is drawn from I is 0(r(/) 2 ). We 
next consider the contribution under the condition that only one sample is drawn from I. For this, 
we let EP/ and EQ^ be the events that the largest element of S preceding I comes from p or q 
respectively. We have that the expected contribution to Zj coming from events where exactly one 
element of S is drawn from I is: 

(Pr[EP/[ — Pr[QPj]) Pr(The only element drawn from I is from p) 

— (Pr[EP/] — Pr[QPj]) Pr(The only element drawn from I is from q). 

Letting xj be the left endpoint of /, this is 

f{x!){mp{I) - mq(I)) + 0(r(/) 2 ). 
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Therefore. 


E[Zi\ = f{xi){mp(I) - mq{I )) + 0(r(/) 2 ). 
Letting X be a partition of our domain into intervals, we find that 


E[z\ = j2nzi] 

lex 


lex 

= 0(m max r(/)) + > f(xi)(mp(I ) — mq(I )). 

/ex 

lex 

As the partition X becomes iteratively more refined, these sums approach Riemann sums for the 
integral of 

mf(x)(p(x) — q(x))dx. 

Therefore, taking a limit over partitions X, we have that 

E [Z] = m J f(x)(p{x) — q{x))dx. 


□ 

We will also make essential use of the following technical lemma: 

Lemma 7 . The function f is differentiable with derivative f'{t ) = m (p(t) — q(t) — (p(t) + q(t))f(t)). 

Proof. Consider the difference between /(f) and /(f + h) for some small h > 0. We note that 
/(f) = E[X)] where Ft is 1 if the sample of S preceding f came from p, — 1 if the sample came from 
q , and 0 if no sample came before f. Note that 

if no samples from p nor q are drawn from [f, f + h] 
if one sample from p and none from q are drawn from [f,f + h] 
if one sample from q and none from p are drawn from [f, f + h] 
if at least two samples from p or q are drawn from [f, t + h\. 

Since p and q are continuous at f € [0,1], these four events happen with probabilities 1 — mh(p(t) + 
q(t )) + o(h), mhp(t ) + o(h), mhq(t) + o(h ), o(/i), respectively. Therefore, taking an expectation we 
find that /(f + h) = /(f)(1 — mh(p(t) + q(t))) + mh(p(t) — q(t)) + o{h). This, and a similar relation 
relating /(t) to f{t — h), proves that / is differentiable with the desired derivative. □ 

To analyze the desired expectation, E [Z], we consider the quantity J 0 ' f'(t)f(t)dt = (1/2) (/ 2 (1) — / 2 (0)). 
Substituting f from Lemma [7] above gives 

[ f'(t)f(t)dt = m f f(t)(p(t) — q{t))dt — m f f 2 {t)(p{t) + q(t))dt. 

Jo Jo Jo 

Combining this with Lemma [6j we get 

E [Z] = m f f 2 (t)(p(t) + q(t))dt + / 2 (l)/2 . 

Jo 


Ft+h = < 


F t 

1 

-1 

±1 


(1) 



The second term in (pQ) above is 0(1), so we focus our attention to bound the first term from below. 
To do this, we consider intervals / C [0,1] over which \p(I) — q(I)\ is “large” and show that they 
must produce some noticeable contribution to the first term. Fix such an interval I. We want to 
show that / 2 is large somewhere in I. Intuitively, we attempt to prove that on at least one of the 
endpoints of the interval, the value of / is big. Since / does not vary too rapidly, f 2 will be large 
on some large fraction of I. Formally, we have the following lemma: 

Lemma 8. For 5 > 0, let I C [0,1] be an interval with \p(I) — q(I) \ = 6 and p(I) + q(I) < 1/m. 
Then, there exists an x € I such that \f{x)\ ^ 


Proof. Suppose for the sake of contradiction that |/(x)| < m6/3 for all x € I = [X, Y], Then, we 
have that 


2m5/3 > \f(X) - f(Y)\ = 


[ f'{t)dt 
Jx 


r*Y 


(m(p(t) - q(t )) - mf (t)(p(t) + q(t ))) dt 


ix 


m(p(I) - q{I)) ~ m [ f(t)(p(t) + q(t))dt 
Jx 


^ m\p(I) — q(I) | — m 


rY 


lx 


f(t)(p(t) + q{t))dt 


> m5 — m [ (m5/3) (p(t) + q(t))dt = md (1 — m(p(I) + q(I))/3) > 2m5/3 , 

Jx 

which yields the desired contradiction. 


□ 


We are now able to show that the contribution to E [Z\ coming from such an interval is large. 
Lemma 9. Let I be an interval satisfying the hypotheses of Lemma 0. Then 

+ q(t))dt = n(m 2 5 3 ) . 


I, 


Proof. By Lemma [HJ / is large at some point x of the interval I = [X, Y], Without loss of 
generality, we assume that p([X,x]) + q([X,x]) ^ (p(I) + q(I))/2. Let /' = [x, Y'] be the interval 
so that p(I') + q(I ') = 6/9. Note that /' C I (since by assumption | p(I) — q(I) \ > 6 and thus 
p(I) + q(I) > 6). Furthermore, note that since with probability at least 1 — mb/ 9, no samples from 
S lie in I', we have that for all 2 in I' it holds | f(x) — f(z)\ ^ 2m6/9 , so \ f(z)\ ^ md/ 9. Therefore, 

Ji f 2 (t)(p(t) + q(t))dt ^ Ji f 2 (t){p(t) + q{t))dt ^ (^pj ( P{t ) + q(t))dt 


771 ^ 5 ^ 


81 


729 


□ 


Since ||p — q\\A k > e i there is a partition T of [0,1] into k intervals so that || p!/ — q£\\i > e. 
By subdividing intervals further if necessary, we can guarantee that I has at most 3 k intervals, 
|| p/: — || > e, and for each subinterval I € T it holds p(I),q(I) ^ 1/k. For each such interval 

l£l, let 61 = | p{I) — q(I) |. Note that ^ €- 

By dU) we have that 

E [Z] = f f 2 (t)(p(t) + q(t))dt + 0( 1) 

lex - 21 

= Q m 2 5/^j = ^m 3 (^ d/) 3 /(3fe) 2 ^ 

= n (m 3 e 3 /k 2 ) = n(C 5/2 V^) ■ 
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We note that the second to last line above follows by Holder’s inequality. It remains to bound from 
above the variance of Z. 

Lemma 10. We have that Var[Z] = 0(m) . 

Proof. We divide the domain [0,1] into m intervals /*, i = 1 ,m, each of total mass 2/m under 
the sum-distribution p + q. Consider the random variable Xi denoting the contribution to Z 
coming from pairs of adjacent samples in S such that the right sample is drawn from /*. Clearly, 
Z = £I=i Xi and Var[Z] = £™ =1 Var[4Q] + £^. CovpQ, Xj). 

To bound the first sum, note that the number of pairs of S in an interval /j is no more than 
the number of samples drawn from Jj, and the variance of X\ is less than the expectation of the 
square of the number of samples from Ij. Since the number of samples from I* is a Poisson random 
variables with parameter 2, we have VarpQ] = 0(1). This shows that £’£ Var[X,;] = 0(m). 

To bound the sum of covariance, consider Xi and Xj conditioned on the samples drawn from 
intervals other than and Ij. Note that if any sample is drawn from an intermediate interval, 
Xi and Xj are uncorrelated, and otherwise their covariance is at most yJXax(Xi)Xax(Xj) = 0(1). 
Since the probability that no sample is drawn from any intervening interval decreases exponentially 
with their separation, it follows that Cov(W, Xj) = 0(1) ■ e ~This completes the proof. □ 

An application of Chebyshev’s inequality completes the analysis of the soundness and the proof 
of Proposition [3l □ 

3.2 The General Tester In this section, we present a tester whose sample complexity is optimal 
(up to constant factors) for all values of e and k, thereby establishing the upper bound part of 
Theorem [T] Our general tester (Algorithm Test-Identity-*4^) builds on the tester presented in 
the previous subsection (Algorithm Simple-Test-Identity-). It is not difficult to see that the 
latter algorithm can fail once e becomes sufficiently small, if the discrepancy between p and q 
is concentrated on intervals of mass larger than 1/m. In this scenario, the tester Simple- 1 Test- 
Identity- Ak will not take sufficient advantage of these intervals. To obtain our enhanced tester 
Test-Identity-Afc, we will need to combine Simple-Test-Identity-Afc with an alternative tester when 
this is the case. Note that we can easily bin the distributions p and q into intervals of total mass 
approximately 1/m by taking m random samples. Once we do this, we can use an identity tester 
similar to that in our previous work |DKN15| to detect the discrepancy in these intervals. In 
particular we show the following: 

Proposition 11. Letp,q be discrete distributions over [n] satisfying ||p|| 2 , IMI 2 = 0(1/y/n). There 
exists a testing algorithm with the following properties: On input k € Z +J 2 < k < n, and 6, e > 0, 
the algorithm draws O ^( y/k/e 2 ) • log(l/d)^ samples from p and q and with probability at least 1 — 6 
distinguishes between the cases p = q and \\p — q\\A k > e - 

The above proposition says that the identity testing problem under the Ak distance can be 
solved with 0(\fk/e 2 ) samples when both distributions p and q are promised to be “nearly” uni¬ 
form (in the sense that their L 2 norm is 0(1) times that of the uniform distribution). To prove 
Proposition [TT] we follow a similar approach as in |DKN15j : Starting from the L 2 identity tester 
of |CDVV14| . we consider several oblivious interval decompositions of the domain into intervals of 
approximately the same mass, and apply a “reduced” identity tester for each such decomposition. 
The details of the analysis establishing Proposition 1111 are postponed to Appendix [XI 

We are now ready to present our general testing algorithm: 
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Algorithm Test-Identity- Ak(p, Q, e) 

Input: sample access to distributions p, q : [0,1] —> R+, k € Z+, and e > 0. 

Output: “YES” if q = p\ “NO” if \\q - p\\ Ak > e. 

1. Let m = Ck 4 0 /e 6 / 5 , for a sufficiently large constant C. Draw two sets of samples S p , 
S q each of size Poi(m) from p and from q respectively. 

2. Merge S p and S q while remembering from which distribution each sample comes from. 
Let S be the union of S p and S q sorted in increasing order (breaking ties randomly). 

3. Compute the statistic Z defined as follows: 

Z = ft (pairs of successive samples in S coming from the same distribution) — 
ft (pairs of successive samples in S coming from different distributions) 

4. If Z > 5 ypm return “NO”. 

5. Repeat the following steps 0(C) times: 

(a) Draw Poi(m) samples from (p + q)/ 2. 

(b) Split the domain into intervals with the interval endpoints given by the above 
samples. Let p' and q' be the reduced distributions with respect to these intervals. 

(c) Run the tester of Proposition 1111 on p' and q' with error probability 1/C 2 to 
determine if \\p' — q'\\A 2 k+i > c/C. If the output of this tester is “NO”, output 
“NO”. 

6. Output “YES”. 


Our main result for this section is the following: 

Theorem 12. Algorithm Test-Identity-Ak draws 0(max{£: 4 /' 5 /e 6 / 5 , Ad/ 2 /e 2 }) samples from p,q and 
with probability at least 2/3 returns “YES” if p = q and “NO” if \\p — q\\A k > e - 

Proof. First, it is easy to see that the sample complexity of the algorithm is 0(m + k 1 / 2 /e 2 ). Recall 
that we can assume that p, q are continuous pdf’s bounded from above by 2. 

We start by establishing completeness. If p = q, it is once again the case that E [Z\ = 0 and 
Var[Z] < 2m, so by Chebyshev’s inequality, Step 4 will fail with probability at most 1/9. Next 
when taking our samples in Step 5(a), note that the expected samples size is 0(m) and that the 
expected squared L 2 norms of the reduced distributions p' and q' are 0(l/m). Therefore, with 
probability at least 1 — 1/C 2 , p' and q' satisfy the hypothesis of Proposition [lTJ Hence, this holds 
for all C iterations with probability at least 8/9. 

Conditioning on this event, since p' = q' , the tester in Step 5(c) will return “YES” with 
probability at least 1 — 1 /C 2 on each iteration. Therefore, it returns “YES” on all iterations with 
probability at least 8/9. By a union bound, it follows that if p = q, our algorithm returns “YES” 
with probability at least 2/3. 

We now proceed to establish soundness. Suppose that \\p — q\\A k ^ e - Then there exists a 
partition X of the domain into k intervals such that \\p% — cft\\ ^ e. For an interval / € X, let 
5(1) = | p(I) — q(I)\. We will call an I G X small if there is a subinterval J C I so that p(J) + q(J) < 
1/m and | p(J) — q(J) \ ^ 5(1)/ 3. We will call I large otherwise. Note that 1 small <K-0 + 
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E/ex,; large ^CO = E/ e x <K J ) > e. Therefore either E/ e x,/ small <K J ) > e A or E/ex,/ large > 
e/2. We analyze soundness separately in each of these cases. 

Consider first the case that E/ex / small 'K-O ^ e/2. The analysis in this case is very similar to 
the soundness proof of Proposition [3] which we describe for the sake of completeness. 

By definition, for each small interval I, there exists a subinterval J so that p(J)+q(J) < 1 /m and 
\p(J) — q(J)\ > 5(1)/ 2. By Lemma [Dj for such J we have that fj f 2 (t)(p(t) + q(t))dt = H(m 2 5 3 (I)), 
and therefore, that j/ f 2 (t)(p(t) + q(t))dt = Q(m 2 5 3 (I)). Hence, we have that 


E [Z] ^ 

mf f 2 (t)(p(t)+q(t))dt 

Jo 


^2 m [+ q(t))dt 


/eX,/ small J 1 


Y, V(m 3 5 3 (I)) 


/EX,/ small 

/ \ 3 


0(m 3 )[ Y 5 ( J )) A' 


y/EX,/ small J 

= 

Q(m 3 e 3 /k 2 ) 

= 

n(c 5 / 2 V^)- 


On the other hand, Lemma fTOl gives that Var[Z] = 0(m), so for C sufficiently large, Chebyshev’s 
inequality implies that with probability at least 2/3 it holds Z > 5 y/m. That is, our algorithm 
outputs “NO” with probability at least 2/3. 

Now consider that case that E/ex / large ^ e/2. We claim that the second part of our tester 
will detect the discrepancy between p and q with high constant probability. Once again, we can say 
that with probability at least 8/9 the squared L 2 norms of the reduced distributions p' and q' are 
both 0(l/m) and that the size of the reduced domain is 0(m). Thus, the conditions of Proposition 
[EH are satisfied on all iterations with probability at least 8/9. To complete the proof, we will show 
that with constant probability we have \\p' — q'\\A 2 k+i > e/C. To do this, we construct an explicit 
partition X' of our reduced domain into at most 2k + 1 intervals so that with constant probability 
\\Pr ~~ Qr 111 > e/C. This will imply that with probability at least 8/9 that on at least one of our 
C trials that || p% — eft ||i > e/C. 

More specifically, for each interval I E X we place interval boundaries at the smallest and largest 
sample points taken from I in Step 5(a) (ignoring them if fewer than two samples landed in I). 
Since we have selected at most 2k points, this process defines a partition X' of the domain into at 
most 2k + 1 intervals. We will show that the reduced distributions p" = p'/ and q" = qZ/ have large 
expected L\ error. 

In particular, for each interval I E X let I' be the interval between the first and last sample 
points of I. Note that I' is an interval in the partition X'. We claim that if I is large, then with 
constant probability 

| p(l')-q(l')\= £1(5(1)). 

Let I = [X, Y] and I' = [x,y\ (so x and y are the smallest and largest samples taken from I, 
respectively). We note that iip([X, a;]) + q([X 1 x\) < 1/m and p([y,Y]) + q([y,Y]) < 1/m then 

\p(l')-q(l')\ > b(/)-g(/)|-b([X,x])-g([X,x])|-b([ 2/ ,y])-g([y,F])| ^ 5(I)-8(I)/3-5(I)/3 = 5 ( 1 )/ 3, 
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where the second inequality uses the fact that I is large. On the other hand, we note that p([X, x]) + 
q([X, x]) and p([y , Y]) + q([y,Y]) are exponential distributions with mean 1/m, and thus, this event 
happens with constant probability. Let Nj be the indicator random variable for the event that 
| p(I') — q(I ')| ^ 5(1)/3. We have that 

11/ - /Ik ^ E ^W)/ 3 > E ^w)/ 3 - 

I l£l,I large 


Thus, we have that 


II/-/111 > E W3- E (1 -Nj)5(I)/3. 

IeX,I large I EX,I large 


Therefore, since 


E 


E (! - Ni)S(I)/3 

lEX,I large 


< ( xi w / 3 ] (i 

,IeX,I large 


- c 


for some hxed c > 0, we have that with constant probability that 

11/ - /111 > c ^2 5(/)/ 3 ^ ce /6 ^ e/C. 

I EX,I large 


This means that with probability at least 8/9 for at least one iteration we will have that \\p’ — 
q'\\A 2k +i > e /C, and therefore, with probability at least 2/3, our algorithm outputs “NO”. 

□ 

4 Lower Bound for Ak Closeness Testing 

Our upper bound from Section [3] seems potentially suboptimal. Instead of obtaining an upper 
bound of 0(max{&; 2 / 3 /e 4 / 3 , A; 1 / 2 /e 2 }), which would be analogical to the unstructured testing result 
of jCDVV14] , we obtain a very different bound of 0(max{/c 4 / 5 /e 6 / 5 , fc 1//2 /e 2 }). In this section 
we show, surprisingly, that our upper bound is optimal for continuous distributions, or discrete 
distributions with support size n that is sufficiently large as a function of k. 

Intuitively, our lower bound proof consists of two steps. In the first step, we show it is no loss 
of generality to assume that an optimal algorithm only considers the ordering of the samples, and 
ignores all other information. In the second step, we construct a pair of distributions which is hard 
to distinguish given the condition that the tester is only allowed to look at the ordering of the 
samples and nothing more. 

Our first step is described in the following theorem. We note that unlike the arguments in the 
upper bound proofs, this part of our lower bound technique will work best for random variables of 
discrete support. 

Theorem 13. For all n, k, m € there exists N € such that the following holds: If there 
exists an algorithm A that for every pair of distributions p and q, supported over [IV], distinguishes 
the case p = q from the case \\p — q\\A k ^ e drawing m samples, then there exists an algorithm A' 
that for every pair of distributions p' and q' supported on [n] distinguishes the case p' = q' versus 
Ib' — bibs, ^ e using the same number samples m. Moreover, A' only considers the ordering of the 
samples and ignores all other information. 
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Proof. As a preliminary simplification, we assume that our algorithm, instead of taking m samples 
from any combination of p or q of its choosing, takes exactly m samples from p and m samples from 
q , as such algorithms are strictly more powerful. This also allows us to assume that the algorithm 
merely takes these random samples and applies some processing to determine its output. 

As a critical tool of our proof, we will use the classical Ramsey theorem for hypergraphs. For 
completeness, we restate it here in a slightly adapted form. 

Lemma 14 (Ramsey theorem for hypergraphs, [ CFS10] ). Given a set S and an integer t let (^) 
denote the set of subsets of S of cardinality t. For all positive integers, a, b and c, there exists a 
positive integer N so that for any function f : (^) —> [ 6 ], there exists an S C [N] with [S'! = c so 
that f is constant on (^). 

In words, this means that if we color all subsets of size a of a size N set with at most b different 
colors, then for large enough N we will find a (bigger) subset T such that all its subsets are colored 
with the same color. Note that in our setting c from the theorem equals n. 

The idea of our proof is as follows. Given an algorithm A, we will use it to implement the 
algorithm A'. Given A, we produce some monotonic function / : [n] —>• [N], and run A on the 
distributions f(p) and f(q). Since / is order preserving, \\f(p) — f(q)\\A k = \\p ~ <711.4*., so our 
algorithm is guaranteed to work. The tricky part will be to guarantee that the output of this new 
algorithm A! depends only on the ordering of the samples that it takes. Since we may assume that 
A is deterministic, once we pick which 2m samples are taken from [N] the output will be some 
function of the ordering of these samples (and in particular which are from p and which are from 
q). For the algorithm A, this function may depend upon the values that the samples happened to 
have. Thus, for A' to depend only on order, we need it to be the case that A behaves the same way 
on any subset of Im(/) of size 2m. Fortunately, we can find such a set using Lemma IT~4l 

Since our sample set has size at most 2m, it is clear that the total number of possible sample 
sets is at most N 2m . We color each of these subsets of [N\ of size a = 2m one of a finite number of 
colors. The color associates to the sample set the function that A uses to obtain an output given 
2m samples given by this set coming in a particular order (some of which are potentially equal). 
The total number of such functions is at most b = 2 2 . We let n be the proposed support size for 

p' and q'. By Lemma 1141 for N sufficiently large, there are sets of size n such that the function has 
the same value in samples from these sets. Letting / be the unique monotonic function from [n] to 
[N\ with this set as its image, causes the output A' to depend only on the ordering of the samples. 

The above reduction works as long as the samples given to our algorithm A' are distinct. To 
deal with the case where samples are potentially non-distinct, we show that it is possible to reduce 
to the case where all 2m samples are distinct with 9/10 probability. To do this, we divide each 
of our original bins into 200 m 2 sub-bins, and upon drawing a sample from a given bin, we assign 
it instead to a uniformly random sub-bin. This procedure maintains the Ak distance between our 
distributions, and guarantees that the probability of a collision is small. Now, our algorithm A' 
will depend only on the order of the samples so long as there is no collision. As this happens 
with probability 9/10, we can also ensure that this is the case when collisions do occur without 
sacrificing correctness. This completes our proof. □ 

We will now give the “hard” instance of the testing problem for algorithms that only consider 
the ordering of the samples. We will first describe a construction that works for e = f2(£; -1 / 6 ). We 
define a mini-bucket to be a segment I, which can be divided into three subsegments 3 in 

that order so that p(h) = p(If) = e/(2 k), p{l 2 ) = 0, and q(h) = 9 (^ 3 ) = 0,q(l2) = e/k. We define 
a bucket to be an interval consisting of a mini-bucket followed by an interval on which p = q and on 
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which both p, q have total mass (1 — e)/k. Our distributions for p and q will consist of k consecutive 
buckets. See Figure 1 for an illustration. 
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Figure 1: f> = \p + \q when e = 1 


Next consider partitioning the domain into macro-buckets each of which is a union of buckets 
of total mass 0(l/m). Note that these distributions have A 2 k+i distance of 2e. An important fact 
to note is the following: 

Observation 15. If zero, one or two draws are made randomly and independently from (p + q)/2 
on a mini-bucket, then the distribution of which of p or q the samples came from and their relative 
ordering is indistinguishable from the case where p = q. 

To prove the lower bound for the algorithm A' , which is only allowed to look at the ordering 
of samples. We let A be a random variable that is taken to be 0 or 1 each with probabilty 1/2. 
When X = 0 we define p and q as above with mini-buckets, macro-buckets and regular buckets as 
described. When X = 1, we let p = q and define mini-buckets to have total mass e/k for each of p 
and q, buckets to have total mass 1/k each, and we combine buckets into macro-buckets as in the 
X = 0 case. 

Let Y be the distribution on the (ordered) sequences, obtained by drawing ml = Poi(m) samples 
from p and m" = Poi(m) samples from q, with p and q given by X. We are interested in bounding 
the mutual information between X and Y, since it must be 12(1) if the algorithm is going to succeed 
with probability bounded away from 1/2. We show the following: 

Theorem 16. We have that I(X : Y) = 0(m 5 e 6 /fc 4 ). 

Proof. We begin with a couple of definitions. Let Y' denote ( Y,a ), where a is the information 
about which draws come from which macro-bucket. Y' consists of Yf, the sequence of samples 
coming from the i-th macro-bucket. Note that 

O(m) 

I(X : Y) ^ I(X : Y') < E W '■ Y t) ■ 

i=l 


We will now estimate I(X : Yf). We claim that it is 
sum to be small enough and give our theorem. We have that, 


for each i. This would cause the 


I(X : Y() = 


O 1 


Pr(Y i ' = y\X = 0) \ 2 ' 
Pr(Y i '=y\X = l)J 
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We then have that 


I(X : Yl) = Y.T. 


£=0 y\\y\=l 


o(il 

£\ 


o 



IM YL 

Pr (Yf 


y\X = 0,\y\=t) 
y\x = 1 , \y\ = £) 


2 


We note that if X = 1, |y| = £ that any of the 2 l possible orderings are equally likely. On 
the other hand, if X = 0, this also holds in an approximate sense. To show this, first consider 
picking which mini-buckets our £ draws are from. If no three land in the same mini-bucket, then 
Observation [15] implies that all orderings are equally likely. Therefore, the statistical distance 
between Y[\X = 0, |y| = £ and Y- \X = l,\y\ = £ is at most the probability that some three draws 
come from the same mini-bucket. This is in turn at most the expected number of triples that 
land in the same mini-bucket, which is equal to (^) times the probability that a particular triple 
does. The probability of landing in a particular mini-bucket is 0(me/k) 3 . By definition, there are 
0(m/k ) mini-buckets in a macro-bucket, so this probability is 0(£ 3 e 3 (m/fc) 2 ). Therefore, we have 
that 


I(X:l?) = E^r- E O(4')(Pr(T/ = y|X = 0,H 




£! 

I y:\y\=t 

0 ( 1 )' 


E 


£\ 


E \Pv(Y' = y\X = 0,\y\=£) 

y:\y\=e 


E —( r° (^ 6 ™ 4 / fc4 ) 


m 4 
1 


rrr ^ 0(l)¥e 6 


£\ 


= O 


m 4 e 6 

fc 4 


£)-Pr(Y' = y\X = l,\y\=£)) 2 
Pi(Y' = y\X = l,\y\=£)\) 


This completes our proof. □ 

The above construction only works when k ^ m, or equivalently, when e = When e is 

small, we need a slightly different construction. We will similarly split our domain into mini-buckets 
and macro-buckets and argue based on shared information. Once again we define two distributions 
p and q, though this time the distributions themselves will need to be randomized. Given k and 
e, we begin by splitting the domain into k macro-buckets. Each macro-bucket will have mass 1/k 
under both p and q. 

First pick a global variable X to be either 0 or 1 with equal probability. If X = 1 then we 
will have p = q and if X = 0, ||p — = e - P° r eac h macro-bucket, pick an x uniformly in 

[0, (1 — e)/k]. The macro-bucket will consist of an interval on which p = q with mass x (for each 
of p,q), followed by a mini-bucket, followed by an interval of mass (1 — e)/k - ion which p = q. 
The mini-bucket is an interval of mass e/k under either p or q. If X = 1, we have p = q on the 
mini-bucket. If X = 0, the mini-bucket consists of an interval of mass e/(2 k) under q and 0 under 
p, an interval of mass e/k under p and 0 under q, and then another interval of mass e/(2 k) under 
q and 0 under p. 

We let Y be the random variable associated with the ordering of elements from a set of Poi(m) 
draws from each of p and q. We show: 
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Theorem 17. If me = 0(k), log (mk/e) = 0(e 1 ), and k = 0{m), with implied constants suffi¬ 
ciently small, then I(X : Y) = 0(m 5 e 6 //c 4 ). 

Note that the above statement differs from Theorem [T6l in that X and Y are defined differently. 


Proof. Once again, we let Y' be Y along with the information of which draws came from which 
macro-bucket, and let Yf be the information of the draws from the i-th macro-bucket along with 
their ordering. It suffices for us to show that I(X : Y- ) = 0(m 5 e 6 k~ 5 ) for each i (as now there are 
only k macro-buckets rather than m). 

Let s be a string of I ordered draws from p and q. In particular, we may consider s to be a 
string siS 2 -..S£, where s z € {p, q}. We wish to consider the probability that Y- = s under the 
conditions that X = 0 or that X = 1. In order to do this, we further condition on which elements 
of s were drawn from the mini-bucket. For 1 ^ a ^ b ^ £ we consider the probability that not 
only did we obtain sequence s , but that the draws s a , ■ ■ ■ ,Sb were exactly the ones coming from the 
mini-bucket within this macro-bucket. Let h denote the ordered string coming from elements drawn 
from the mini-bucket and M the ordered sequence of strings coming from elements not drawn from 
the mini-bucket. The probability of the event in question is then 

Pr (h = s a ... Sb) Pr(M = si... s a _iS6 + i ... s?) Pr(the mini-bucket is placed between s a _i and s&+i). 


Note that the mini-bucket can be thought of as being randomly and uniformly inserted within an 
interval of length (1 — e)/k and that this is equally likely to be inserted between any pair of elements 
of M. Thus, the probability of the third term in the product is exactly 1 /{£ + a —b). The second 
probability is the probability that I + a — b — 1 elements are drawn from the complement of the 
mini-bucket times 2~^ +a ^ b+l \ as draws from p and q are equally likely. Thus, letting t = b — a + 1 
(i.e., the number of elements in the mini-bucket), we have that 


Pr(y/ = s) = e~ m/k 



E Pr {h = s a ... s a+t -1 : \h\ = t). 


Note that this equality holds even after conditioning upon X. We next simplify this expression fur¬ 
ther by grouping together terms in the last sum based upon the value of the substring s a ... s a+ t-i, 
which we call r. We get that 


Pr(*7 


- p-m/k 


s) = e 


E 

t =o 


/ ( m(l—e) N 1 
2k 


V 


(l-t)\ 



where N rs is the number of occurrences of r as a substring of s. 
Next, we wish to bound 


E l Pr ( y / = s : X = °) -PrO? = « = X = 1)| 2 . 

\ s \=e 


r : \h\ 


By the above formula this is at most 


Q —2 m/k 


EE 

\ s \=e t =o 


' m(l—e) 
2k 


l-t' 


(£-t)l 



2 


E N r,s (Pr (h = r: \h\ =t,X = 0) - Pr (h = r : \h\ = t,X = 1)) 

\r\=t 


t)N rtS 


(2) 
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For fixed values of t we consider the sum 


2 


yy A r ,. ]S (Pr(/i = r : |/i| = t, X = 0) — Pr(h = r : \h\ = t, X = 1)) 

\s\=l \r\=t 


Note that if t ^ 2 then Pr(h = r : \h\ = t, X = 0) = Pr(/i = r : \h\ = t,X = 1), and so the above 
sum is 0. Otherwise, it is at most 


E Ei ^-^ 1 -*)/ 2 *! 2 

\s\=l\r\=t 


because Yl r P r (^ = r : \h\ = t, X = 0) = Yl r P r (^ = r : |h| = t, X = 1) = 1. Note on the other hand 
that the expectation over random strings s of length l of N rs — (£ + 1 — t)/2 t is 0. Furthermore, 
the variance of N rs is easily bounded by t£2~ t as whether or not two disjoint substrings of s are 
equal to r are independent events. Therefore, the above sum is at most 

2 e 2 t tl2~ t = 2 hi. 


Hence, by Cauchy-Schwartz, we have that 


E 


J2 N r,s-{£+ I-*)/ 2 * 

\r\=t 


< 2 e 2Hi. 


Therefore, the expression in ([2D is at most 

( 


0 —2m/k 


E 

t =3 


/ ( m(l-e) Y~ 

2k 


\ 


(l-t)\ 


' Q(t)' 

t\ 


i-t 


2 i 2 t lt 


1/2 


Assuming that le is sufficiently small, these terms are decreasing exponentially with t, and thus 
this is 


O e 


-2 m/k ^rn 2 /(2k 2 )Y ^ 


Now we have that for N a sufficiently small constant times e , 

IiX : Yl) = v Pr« - • : V - 1)0 (l - 


EE'’ 

t s:|s|=^ 


v*! ( 

i\ 


0(Pr(y/ = s : X = 1) - Pr(y/ = s : X = 0)f 


^E e ™ 

i 

<E« 

e>N 

<E° 

t>N 


/k ( (m/(2k)) 1 


ex- 1 


l\ 


( [2m/kY 


l\ 

m \ ^ 
kN 


+ E 


o E Pr W = 5 : V = 1) - PriV = s : X = 0)1' 

\s:|s|=£ 

/k ( (m/{2k)Y y L / 2 m/k f (rn 2 /(2k 2 )Y \ 6 5 


i<N 


l\ 


(i\y 


+E° 


: —m/k 


( m/kY 


e 6 l 5 
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Since ^ is sufficiently small, the first term is at most {1/2) N which is polynomially small 

in mk/e, and thus negligible. The second term is the expectation of e 6 £ 5 for I a Poisson random 
variable with mean m/k. Thus, it is easily seen to be 0((m/fc) 5 e 6 ). Therefore, we have that 
I(X : Y[) = 0(m 5 e 6 k~ 5 ), and therefore, I(X :Y) = 0(m 5 e 6 /c -4 ), as desired. □ 

We are now ready to complete the proof of our general lower bound. 

Theorem 18. For any k > 2, there exists an N so that any algorithm that is given sample access 
to two distributions, p and q over [IV], and can distinguish between the cases p = q and \\p — q\\A k 
with probability at least 2/3, requires at least Ll (max {fc 4 / 5 /e 6 / 5 , fc 1//2 /e 2 }) samples. 

Proof. The lower bound of fc 1//2 /e 2 follows from the known lower bound jPan08] even in the case 
where q is known and p and q have support of size k. It now suffices to consider the case that 
e > A; -1 / 2 and m a sufficiently small constant times & 4 / 5 e~ 6 / 5 . 

Note that by Theorem 1 131 we may assume that the algorithm in question takes m samples from 
each of p and q and determines its output based only on the ordering of the samples. We need to 
show that this is impossible for N sufficiently large. 

We note that if we allow p and q to be continuous distributions instead of discrete ones we 
are already done. If m < k, we use our first counter-example construction, and if m ^ k use the 
second one. If we let X be randomly 0 or 1, and set p = q for X = 1 and p, q as described above 
when X = 0, then by Theorems 1161 and 1171 the shared information between X and the output of 
our algorithm is at most 0(m 5 e 6 £T 4 ) = o(l), and thus our algorithm cannot correctly determine 
X with constant probability. 

In order to prove our Theorem, we will need to make this work for distributions p and q 
with finite support size as follows: By splitting our domain into m 3 intervals each of equal mass 
under p + q, we note that the Ak distance between the distributions is only negligibly affected. 
Furthermore, with high probability, m samples will have no pair chosen from the same bin. Thus, 
the distribution on orderings of samples from these discrete distributions are nearly identical to 
the continuous case, and thus our algorithm would behave nearly identically. This completes the 
proof. □ 
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Appendix 

A Proof of Proposition 1111 

In this section, we prove Proposition [TlJ We note that it suffices to attain confidence probability 

2/3 with 0{\fh/e 2 ) samples, as we can then run 0(log(l/<5)) independent iterations to boost the 

confidence to 1 — 5. Our starting point is the following Theorem from [CDVV14] : 

Theorem 19 r |GDVV14| . Proposition 3.1). For any distributions p and q over [ n] such that \\p \\2 ^ 
and \\q\\ 2 ^ there is a testing algorithm that distinguishes with probability at least 2/3 the 

case that q = p from the case that 1— p| I 2 ^ when given 0(y/n/e 2 ) samples from q and p. 

Our Ak testing algorithm for this regime is the following: 
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Algorithm Test-Identity-Fiat -Ak (p- q, n, e) 

Input: sample access to distributions p and q over [n] with ||p|| 2 , IMI 2 = 0(l/y/n), k G Z + 
with 2 < k < n, and e > 0. 

Output: “YES” if q = p; “NO” if \\q - p\\ Ak > e. 

1. Draw samples Sj, S 2 of size m = Ofs/k/e 2 ) from q and p. 

2. By artificially increasing the support if necessary, we can guarantee that n = k ■ 2 J0 , 
where j 0 d = [log 2 (l/e)] + 0(1). 

3. Consider the collection of jo partitions of [n] into intervals; the partition 

X(f) = consists of tj = k ■ 2 J many intervals with I-'* 0 ' 1 of length n/ij + 0(1), and 

/p '- the union of two adjacent intervals of I^ +1 \ 

4. For j = 0,1,..., jo - 1: 

(a) Consider the reduced distributions q^ iJ) and . Use the samples Si, S 2 to simulate 
samples to q^ J) and . 

(b) Run Test-Identity-X 2 (?r 0> ■ , ej, Sj ) for ej = C ■ e ■ 2 3j / 8 for C > 0 a sufficiently 

small constant and Sj = 2~^/6, i.e., test whether q^ J) = p^ lJ> versus || q^ U) — 

P? 0) || 2 >7/=D/^. 

5. If all the testers in Step 3(b) output “YES”, then output “YES”; otherwise output “NO”. 

Note in the above that when ej > 1, that the appropriate tester requires no samples. The 
following proposition characterizes the performance of the above algorithm. 

Proposition 20. The algorithm Test-Identity-Flat-(p, q, n, e), on input a sample of size m = 
Of\/k/e 2 ) drawn from distributions q and p over [n] with \\pW 2 , \\qW 2 = 0(\/yfn), e > 0, and an 
integer k with 2 < k < n, correctly distinguishes the case that q = p from the case that ||<?— p\\A k — e > 
with probability at least 2/3. 

Proof. First, it is straightforward to verify the claimed sample complexity, as the algorithm only 
draws samples in Step 1. Note that the algorithm uses the same set of samples Si, S 2 for all testers 
in Step 4(b). Note that it is easy to see that \\p% J H 2 , \\q^ ||2 = 0{\/y/ij), and therefore, by 
Theorem fl9l the tester Test-Identity-L 2 {q ^ <:i> , p ^ (j) , lj , (j, S j), on input a set of mj = 0((y/lj/e 2 ) ■ 
log(1 /<5y)) samples from q and p^ (3) distinguishes the case that = p^ J) from the case that 
lk? U) — P? <J) H 2 > Tj = € j /with probability at least 1 — Sj. From our choice of parameters 
it can be verified that rriaxj rrij < m = 0(\/k/e 2 ), hence we can use the same sample S\, S 2 as 
input to these testers for all 0 < j < jo — 1. In fact, it is easy to see that m j = 0(m), 

which implies that the overall algorithm runs in sample-linear time. Since each tester in Step 3(b) 
has error probability Sj, by a union bound over all j G {0,..., jo — 1}, the total error probability 
is at most Yjj°=o — (1/6) ' J2 < jLo‘^~'’ = 1/3- Therefore, with probability at least 2/3 all the 
testers in Step 4(b) succeed. We will henceforth condition on this “good” event, and establish the 
completeness and soundness properties of the overall algorithm under this conditioning. 

We start by establishing completeness. If q = p, then for any partition X^), 0 < j < jo — 1, 
we have that c^ U) = p^}^. By our aforementioned conditioning, all testers in Step 3(b) will output 
“YES”, hence the overall algorithm will also output “YES”, as desired. 

We now proceed to establish the soundness of our algorithm. Assuming that ||g — p\\ Ak ^ T we 
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want to show that the algorithm Test-Identity-*4fc(<7, n, e) outputs “NO” with probability at least 
2/3. Towards this end, we prove the following structural lemma: 

Lemma 21. For C > 0 a sufficiently small constant, if ||g — p\\^ k > e, there exists j € Z + with 
0 < j < jo — 1 such that ||g/ r(j) — p^ (3) ||| > 7 2 - 

Given the lemma, the soundness property of our algorithm follows easily. Indeed, since all 
testers Test-Identity-L 2 (^ (j) , £j, ej, 5j) of Step 4(b) are successful by our conditioning, Lemma l2ll 
implies that at least one of them outputs “NO”, hence the overall algorithm will output “NO”. □ 

The proof of Lemma [21] is very similar to the analogous lemma in [DKN15] . For the same of 
completeness, it is given in the following subsection. 


A.l Proof of Lemma 1211 We claim that it is sufficient to take C ^ 5 • ICO 6 . Thus, we are in 
the case where n = 2- 70-1 • k and have argued that it suffices to show that our algorithm works to 
distinguish Afc-distance in this setting with €j = ICO 5 • e • 2 3jf / 8 . 

We make use of the following definition: 

Definition 22. For p and q arbitrary distributions over [n], we define the scale-sensitive-L 2 distance 
between q and p to be 


Q~P\\[k] 


def 

= max 
x=(/i,...,/ r )ew 1A 


E 


Piscr 2 (/j) 
width 1 / 8 (If) 


where is the collection of all interval partitions of [n] into intervals of width at most 1/k, 

Discr (/) = | p(I) — q(I)\, and width(I) is the number of bins in I divided by n. 

The first thing we need to show is that if q and p have large Ak distance then they also have 
large scale-sensitive-L 2 distance. Indeed, we have the following lemma: 


Lemma 23. For p and q an arbitrary distributions over [n], we have that 


q-pWfk] > 


(2k) 7 / 8 


Proof. Let e = ||q — p\\\ k - Consider the optimal X* in the definition of the Ak distance. By further 
subdividing intervals of width more than l/k into smaller ones, we can obtain a new partition, 
X' = i, of cardinality s < 2k all of whose parts have width at most l/k. Furthermore, we have 
that Xi Discr(/() ^ e. Using this partition to bound from below ||g — p|| 2 fc ], by Cauchy-Schwarz 
we obtain that 

H 1,2 Discr 2 (I() 

119 P " [k] ^ Y width (^) 1/8 
> ( X t piscr (^)) 2 

" X 4 width(/')V8 

^ e 2 

^ 2k(l/(2k)) 1 / 8 


(2k) 7 / 8 ' 


□ 
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The second important fact about the scale-sensitive-L 2 distance is that if it is large then one of 
the partitions considered in our algorithm will produce a large L 2 error. 

Proposition 24. Let p and q be distributions over [n]. Then we have that 


jo—1 2 i-k 

9-pIIw«io 8 EE 


3=0 i= 1 


Discr 2 (ij"^) 
width 1/8 (4 j) )' 


( 3 ) 


Proof. Let J G be the optimal partition used when computing the scale-sensitive-L 2 distance 

Ik — p\\[k]- In particular, it is a partition into intervals of width at most l/k so that Jfdth^iO 1 / 8 
is as large as possible. To prove ©, we prove a notably stronger claim. In particular, we will prove 
that for each interval Ji G J 


Piscr 2 (Ji) < 1q8 J v Discr^(4 j) ) 

width 1 / 8 (J^) " j=0 . width 1 / 8 (4 J) )' 


Summing over £ would then yield Ik — P\\fk] 011 ^ ie l ian d side and a strict subset of the terms 
from ([3|) on the right hand side. From here on, we will consider only a single interval Ji. For 
notational convenience, we will drop the subscript and merely call it J. 

First, note that if | J| ^ 10 s , then this follows easily from considering just the sum over j = Jq— 1. 
Then, if t = |J[, J is divided into t intervals of size 1. The sum of the discrepancies of these 
intervals equals the discrepancy of J, and thus, the sum of the squares of the discrepancies is at 
least Discr 2 (J)/f. Furthermore, the widths of these subintervals are all smaller than the width of 
J by a factor of t. Thus, in this case the sum of the right hand side of (J3J) is at least 1/t 7 / 8 ^ 
of the left hand side. 

Otherwise, if | J| > 10 8 , we can find a j so that width(J)/10 8 < 1/(2 J • k) ^ 2 • width(J)/10 8 . 
We claim that in this case Equation Q holds even if we restrict the sum on the right hand side to 
this value of j. Note that J contains at most 10 8 intervals of T^\ and that it is covered by these 
intervals plus two narrower intervals on the ends. Call these end-intervals R\ and R 2 . We claim 
that Discr(i?j) ^ Discr(J)/3. This is because otherwise it would be the case that 

Discr 2 (i?j) Discr 2 (J) 
width 1 / 8 width 1/8 (J)' 


(This is because (1/3) 2 • (2/ 10 s ) 1 / 8 > 1.) This is a contradiction, since it would mean that 
partitioning J into Ri and its complement would improve the sum defining \\q — p||[fc], which was 


assumed to be maximum. This means that the sum of the discrepancies of the If contained in 
J must be at least Discr(J)/3, so the sum of their squares is at least Discr 2 (J)/(9 • 10 8 ). On 
the other hand, each of these intervals is narrower than J by a factor of at least 10 s /2, thus the 


appropriate sum of 


Discr 2 (4 j) ) 
width 1 / 8 {iff 


is at least 


Discr 2 (J) 
10 8 width 1 / 8 (J) ’ 


This completes the proof. 


□ 


We are now ready to prove Lemma EU 


Proof. If |k — P\\A k ^ 6 we have by Lemma [23] that 

W q - p \\[k) > ( 2 /c ) 7 / 8 ’ 
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By Proposition 1241 this implies that 


Therefore, 

g2 - io 8 'V V Discr2(/ ^) 

(“) 7/s " U k width 

j 0 — 1 

= 10 8 ^(2^) 1 /8||/ W -/ O) ||2. 

3= 0 

jo — 1 

Y 2 j/8 \\cf rU) -p lU) \\l ^ 5 • 10 ~ 9 e 2 /k. (5) 


3=0 

On the other hand, if || q 1 ^' 1 — p x< "’ ||| were at most 10~ 10 2 /k for each j. then the sum above 


would be at most 

10~ 10 e 2 /fe^2" i/8 <5 - 10~ 9 e 2 /£;. 
j 


This would contradict Equation ([5]), thus proving that | \q x(3) — Ui j ||| ^ 10 10 2 H A e 2 /k for at least 
one j, proving Lemma [2TJ □ 
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